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Abstract 

We introduce precision-biased parsing: a pars- 
ing task which favors precision over recall 
by allowing the parser to abstain from de- 
cisions deemed uncertain. We focus on 
dependency-parsing and present an ensemble 
method which is capable of assigning parents 
to 84% of the text tokens while being over 
96% accurate on these tokens. We use the 
precision-biased parsing task to solve the re- 
lated high-quality parse-selection task: find- 
ing a subset of high-quality (accurate) trees in 
a large collection of parsed text. We present 
a method for choosing over a third of the in- 
put trees while keeping unlabeled dependency 
parsing accuracy of 97% on these trees. We 
also present a method which is not based on 
an ensemble but rather on directly predicting 
the risk associated with individual parser deci- 
sions. In addition to its efficiency, this method 
demonstrates that a parsing system can pro- 
vide reasonable estimates of confidence in its 
predictions without relying on ensembles or 
aggregate corpus counts. 



1 Introduction and Methodology 

Parsing technology has made great progress over the 
last decade, and current state-of-the-art parsers for 
English have reported accuracies in the low 90% 's. 
Current parsing systems are designed to provide a 
complete parse to every sentence, and are evalu- 
ated based on their average number of correct de- 
cisions over a test corpus. Such evaluation, how- 
ever, does not tell a complete story, as the mistakes 
are never uniformly distributed across sentences. In 



practice, parsers usually perform very well on some 
sentences, but also perform very poorly on others. 
In addition, parsers do not provide confidence esti- 
mates in their predictions, making it hard for down- 
stream applications to rely on parsers' output, even 
if the average parsing quality is high. 

We advocate a modification to the parsing task, 
which we term precision-biased parsing. Rather 
than providing a complete parse to every sentence, 
we advocate providing partial analyses to some sen- 
tences, while trying to guarantee that the structures 
that are provided are of high quality. In other words, 
we advocate parsing systems which are able to trade 
recall for precision. The trade-off can occur either 
at the sentence level (abstaining from providing a 
parse to some sentences) or at the individual attach- 
ment level (abstaining from attaching some words or 
phrases to the rest of the sentence in case the attach- 
ment is uncertain). 

Such trade-off is useful for tasks which rely 
on precise structures [j] These include information 
extraction and question answering systems, sen- 
tence simplification and summarization systems and 
syntax-based translation, as well as linguistically- 
oriented tasks such as learning selectional prefer- 
ences, case frames or lexical ontologies. In addition, 
partial but precise output may prove useful for self- 
training (|McClosky et al., 2006), up training (|Petrov 



et al., 2010| ) and active-learning setups. 

Some previous efforts ( [Alexander Yates and Et- 



zioni, 2006 Reichart and Rappoport, 2007 1 attempt 



The other direction, of trading precision for recall, is also of 
interest. A solution to this inverse problem is already available 
to some extent in the form of k-best parsing and packed forests. 



to identify high-quality parse trees among parser's 
output. This is an instance of trading recall for pre- 
cision at the sentence level. Here, we focus on trad- 
ing recall for precision at the individual attachment 
level. As discussed in section[3j solving the problem 
at the individual attachment level entails a natural 
solution to the problem at the sentence level as well. 
Moreover, we believe there is benefit in solving the 
problem at the attachment level - useful information 
can be extracted from a partial parse tree, even when 
some attachment decisions are missing or marked as 
unreliable. 

The precision-biased task was explored in the 
past in the context of parsers based on manually- 
developed grammars ( |Carroll and Briscoe, 2002[ 
|Watson et al., 2005] ). However, state-of-the-art data- 
driven statistical parsers do not allow trading recall 
for precision. In this paper, we focus on data-driven 
dependency parsing. 

We begin by defining the precision-biased pars- 
ing task and its evaluation measures (Section|2]), and 
provide a strong baseline based on parse-ensembles 
(Section [5]). While effective, the ensemble system 
requires substantial computational effort, and is of 
little theoretical interest as it is well known that com- 
mittee (dis)agreement is a good indicator of confi- 
dence. We propose another method based on parser- 
modeling in Section [6] We train a probabilistic clas- 
sifier to try and predict the risk associated with at- 
tachment decisions in the parser's output. The clas- 
sifier learns the error patterns of the parser, and as- 
signs a reliability score to parse edges. By thresh- 
olding these reliability scores, we can effectively 
trade recall for precision. The method comes close 
to the ensemble-based baseline in terms of precision 
and coverage, while running faster and providing a 
straightforward way to control the recall/precision 
tradeoff. More importantly, it demonstrates that rea- 
sonable confidence estimates of the correctness of 
parser predictions can be attained without relying 
on committee, diversity, or aggregate corpus-based 
counts of recurring structures. Inspecting the behav- 
ior of the learned model on PP-attachment, a clas- 
sic case of syntactic ambiguity, reveal that it is not 
judged by the model as being categorically hard. 
Rather, some PP-attachment cases are marked as un- 
reliable, while others are not. 



2 Precision-biased Dependency Parsing 

In the traditional dependency parsing task, the in- 
put is a set of sentences (the test corpus), and we 
are interested in the most precise analysis for each 
sentence. The task performance is measured based 
on the average number of tokens that got assigned a 
correct parent over the entire test corpus. Crucially, 
the parsing process in this task must assign a parent 
to each of the tokens in the test corpus. 

In contrast, in the precision-biased task we are 
concerned with precision more than recall. We al- 
low the parsing system to abstain from providing an 
analysis to some of the input by skipping some deci- 
sions. That is, we require the parser to assign parents 
to as many of the input tokens as possible, but allow 
it to leave the parents of some tokens unassigned. 
Metrics Let T be the set of input tokens, A be the set 
of tokens that got assigned a parent, and S be the set 
of tokens that were not assigned a parent (T = AUS, 
A n S = 0). Let C C A C T be the set of tokens 
that got assigned a correct parent. Then: 



precision 



\C\ 
\A\ 



recall 



\T\ 



coverage 



By requiring complete coverage of the input to- 
kens (coverage = -^j 
precision = recall - 



- 1) we get A = T, and then 
accuracy, where accuracy 
is the traditional dependency parsing accuracy. 

The precision-biased setting allows coverage < 1. 
The aim is to maximize precision, while still retain- 
ing sufficient coverage of the input tokens]^] 

Parse-selection Previous work addresses the parse- 
selection task: selecting a subset of the input sen- 
tences for which we have high-accuracy parses. This 
is an instance of precision-biased parsing in which 
abstaining on a token requires abstaining also on all 
the other tokens in the same sentence. When dis- 
cussing the parse selection task we distinguish be- 
tween token coverage which is identical to coverage 
as defined above, and sentence coverage which is 
the number of selected sentences divided by the to- 
tal number of input sentences |{^ffi^cr}| - The 
definitions of precision and recall remain as above. 



We chose to balance precision against coverage rather than 
against recall because coverage is an upper-bound on recall: if 
precision— 1 then coverage— recall. 



3 Our Approach 

We tackle precision-biased parsing by defining a 
riskiness function on individual attachment deci- 
sions. The riskiness R(tok,par) of a token/parent 
pair is the inverse of our confidence in the attach- 
ment decision. A high riskiness indicates uncer- 
tainty in the decision, and low riskiness indicates 
that we believe the decision to be correct. We then 
set a riskiness-threshold, and abstain from any par- 
ent assignment for which the riskiness is above the 
threshold. Setting a low riskiness-threshold results 
in higher precision (considering even low-risk de- 
cisions as too risky), and setting a high riskiness- 
threshold results in higher coverage. 

Parse-selection Having defined a risk function 
and a risk-threshold, we get a natural selection crite- 
ria for the parse-selection task: high-quality parses 
are those for which at most K attachments are above 
the riskiness threshold. The precision/coverage bal- 
ance can be controlled either by changing the risk 
threshold, or by changing K. 

3.1 Related Work 

While little research attention was dedicated to the 
precision-biased task ( |Carroll and Briscoe, 2002| 



Watson et al, 2005[ ), several studies address the 
parse- selection task. Yates et.al. (12006b per- 



form parse- selection by filtering out parses con- 
taining "semantically implausible" relations, where 
semantic-plausibility is estimated by high co- 
occurrence of the words in relation in a large corpora 
{i.e., the web). 

Reichart and Rappoport ( 2007) 1 perform commit- 
tee based selection of high-quality constituency- 
parses by calculating an agreement measure be- 
tween 20 copies of a lexicalized parser trained on 
different subsets of a training corpus. 



Sagae and Tsujii (2007) select high-quality de- 
pendency parses by using two dependency parsers 
and selecting only sentences on which both parses 
agree on the entire parse. 

Kawahara and Uchimoto ( |2008| ) identify high- 
quality dependency parses by training a classifier 
based on sentence level features: sentence length, 
average dependency length, number of unknown 
words, number of commas and conjunctions, and 
corpus frequencies of sentence words. 



Reichart and Rappoport ( 2009[ ) identify high- 
quality parses of an unsupervised parser by look- 
ing for parses with many reliable constituents, where 
reliability of a constituent is calculated based on 
the number of times its POS-sequence appears in 
the automatically parsed text. Finally, Dell'Orletta 
et.al, ( |2011| ) assign quality-scores to dependency- 
parses using a metric which measures various syn- 
tactic properties of the parse tree and compares them 
the aggregate measurements over the entire parsed 
corpora. 

To summarize, there are three lines of work ad- 
dressing the parse-selection task: selecting parses 
based upon agreement between a committee of 
parsers, selecting parses based on agreement be- 
tween the parses and aggregate counts over a large 
corpora (either lexicalized "semantic" agreement or 
syntactic agreement), and selecting parses based on 
sentence-specific features (length, vocabulary, num- 
ber of commas and so on). 

In contrast, we are primarily interested in select- 
ing high-quality edges rather than complete parses. 
We view parse-selection as an extension of the 
precision-biased parsing task, and perform parse- 
selection based on the number of risky attachment 
decisions. Our assessment of the riskiness or relia- 
bility of a particular decision is not based on aggre- 
gate corpus counts nor on global features of the in- 
put sentence (though such kinds of information may 
be integrated in the future). In our first method, we 
adopt a committee-based approach, but apply it pri- 
marily for edge-selection. In the second method we 
present below, we investigate features which may 
help the parser assess edge riskiness. We note that 
the marginal edge probabilities obtained from a log- 



linear parsing model as in ( |Smith and Smith, 2001) 
are not reliable predictors of edge riskiness: in- 



deed, the pruning procedure used in (Carreras et al., 



2008 ) consider edges with marginal scores of up to 
10~ 6 of the highest scoring edge as possible candi- 
dates in order to ensure sufficient coverage, indicat- 
ing that such models may greatly overestimate the 
marginals of incorrect edges, while underestimating 
the marginal values of correct edges. 

Kawahara ( 2001| ) present an automatic method for 
Case-Frame dictionary construction for Japanese. 
Their method identify verb case-frames by identi- 
fying reliable syntactic constructions, where the re- 



liability is learned using a hand-crafted heuristic and 
aggregate corpus counts. This demonstrates the use- 
fulness of identifying reliable instances of specific 
constructions. Our proposal is to try and identify 
reliable instances of many different constructions, 
without relying on hand-crafted heuristics. 

4 Data 

Our experiments are based on the dependency- 
version of the Penn WSJ corpus, as converted using 
the Penn2Mal0 software with Collins' head-rules. 
The data is POS-tagged using the HMM-based Hun- 
pos taggeiQ 

While our work is not directly comparable to any 
previous worl^ this presents an opportunity to stop 
following the standard train/test/dev splits, and in 
particular to stop testing only on section 23. In- 
stead, we adopt a setup in which we use sections 2- 
11 (about 18k sentences) for training the parser(s), 
sections 12-15 (8900 sentences) for training the 
riskiness-estimator (where appropriate), section 16 
(2780 sentences) for development and sections 17- 
21 (9500 sentences) for testing. This setup leaves 
a reasonable amount of training data for the statis- 
tical models (the parser training set is roughly the 
same size as the one used in the CoNLL shared task), 
while retaining a much larger test set than the stan- 
dard setting. 

5 Parser-ensemble Riskiness Estimation 

Our first method of estimating the riskiness function 
is using an ensemble method. Ensemble methods 
have been shown to provide good results for depen- 



We use an ensemble of 3 parsers: a linear- 



time shift-reduce parser as described in (Huang et 



dency parsing (Sagae and Lavie, 2006 Hall et al., 



2007), as well as for parse- selection as discussed 
above. Here, we use ensembles to estimate the risk- 
iness of individual edges in a dependency tree. To 
estimate the riskiness, we parse the input sentence 
using k different parsers and take the intersection 
of their predictions. The riskiness of a token/parent 
pair is if all k parsers agree on that prediction, and 
1 otherwise. In the final output, we take only edges 
with riskiness. 



http:// http://w3.msi. vxu.se/~nivre/research/Penn2Malt.html 
4 http://code.google.com/p/hunpos/ 

5 We are not aware of previous work on the precision-biased 
task, while previous work for the parse-selection task either fo- 
cus on constituency-structures, or use non-standard datasets. 



al, 2009i (ShiftReduce), the globally optimized 



first-order projective dependency parser of (McDon 



aid et al, 2 005) (MSTl), and the easy-first parser of 
( |Goldberg and Elhadad, 20T0] ) (EasyFirst). Such 



ensemble was shown in (Goldberg and Elhadad, 



2010[ ) to provide good oracle accuracies, as well 
as state-of-the-art accuracies in a non-oracle setting 
due to the diversity among its parsers. The runtime 
of this ensemble is dominated by the 0{n 2 ) feature 
extraction stage and the 0(n 3 ) inference of the glob- 
ally optimized MSTl parser. 

5.1 Results and Discussion 

The individual parser's scores on the test set 
are 87.4 (ShiftReduce) 88.6 (MSTl) and 88.4 
(EasyFirst). 

Precision-biased Scores The precision-biased 
scores of the ensemble system on the test-set are 
96% precision with a coverage of 84.2% (recall of 
80.8%). By not providing an analysis for about 15% 
of the input tokens, we get an impressive gain in 
precision. 

Parse-selection Scores As discussed above, we re- 
duce parse-selection to risk-based precision-biased 
parsing by selecting parses with at most K risky 
attachments. Table 5.1 shows the precision and 
sentence-coverage on the development and test set 
for various values of K. With a K value of (forc- 
ing the parsers to agree on all edges) achieves a 
precision of 97.5 while covering about a quarter of 
the sentences in the test set. By allowing one dis- 
agreement between the parsers the precision drops to 
95.0, but we gain a better sentence-coverage - about 
36%. Increasing the value of K decreases the se- 
lected parses accuracy while increasing their quan- 
tity. 



K 


Precision (%) 


Sentence-Coverage (%) 




dev / test 


dev / test 





97.5 /97.8 


23.2/24.6 


1 


95.0/95.6 


36.2/36.8 


2 


93.0/93.4 


47.2/47.3 


3 


91.0/91.3 


56.9/57.1 


4 


89.4 / 89.5 


64.5 / 66.0 



Table 1: Ensemble-based Parse-Selection Precision and Cov- 
erage for various risk-cutoffs (K) 



6 Single-parser Riskiness Estimation 

While the ensemble method is effective at the 
precision-biased task, it has two shortcomings: (1) it 
takes a long time to run due to the runtime complex- 
ity of the MSTl parser, and (2) it does not provide a 
way of tuning the precision/coverage balance. 

Here we take a different route. We use a single 
parser (we use the EasyFirst parser for its bal- 
ance between speed and accuracy and its incremen- 
tal parse construction), and train a discriminative 
probabilistic classifier to predict the risk associated 
with its predictions. 

EasyFirst is a greedy parser that work by incre- 
mentally adding dependency edges in a bottom-up 



fashion (see (Gold berg and Elhadad, 2010] ) for the 
details). It is trained to take easy decisions before 
harder ones, but does not provide confidence in its 
predictions: at a given step, the highest scoring ac- 
tion can still be very ambiguous, yet easier than the 
alternatives. The two measures of the best possible 
("easiest") action and the riskiness of an action are 
interrelated, but not identical. The easiest action, at 
a given stage, may still be risky. For example, con- 
sider the case in which the parser sees a configura- 
tion consisting of [Verb Noun Prep]. This is a PP 
attachment ambiguity, where the Prep should be the 
child of either the Noun or the Verb. Concretely, 
the parser should choose to either attach Noun un- 
der Verb and then Prep under Verb+Noun, or to first 
attach Prep under Noun and then Noun+Prep un- 
der Verb. At this stage, the two possible attach- 
ments (Verb+Noun and Noun+Prep) are risky (even 
though the Verb+Noun edge will turn out in the final 
parse in any case), but the parser should neverthe- 
less choose one of them. It will choose, based on its 
training experience and on the specific properties of 
the VP, NP and PP at hand, the action which it finds 
is most correct. This would be the least-risky attach- 
ment at a given stage, but it does not reflect directly 
on the objective riskiness of the decision at large. 

In the precision-biased setting, we are interested 
in assessing the objective riskiness of various parser 
decisions. 

Riskiness Predictor We train a separate classifier 
to assess the riskiness involved in each prediction. 
We interpret the riskiness as a probability function: 

risk(context) = Pr (decision is wrong\context). 



In words: the riskiness is the probability of the 
parser making a wrong choice in a given situation 
(context). The riskiness function does not necessar- 
ily depend on the actual decision (i.e., it should be 
interpreted as "when faced with situation X you are 
likely to make a mistake" rather than as "attaching 
ti below tj in situation X is likely to be wrong"), but 
the decision can be encoded in the context if desired. 

We treat riskiness prediction as a binary classi- 
fication task, and fit a Maximum Entropy modej^] 
based on training data as described in Section 7.1 
below. 

We experimented with several alternative inter- 
pretations of the riskiness function, capturing differ- 
ent kinds of information (features). 

Riskiness of parser actions An interesting question 
that arises is whether the information available dur- 
ing parsing is sufficient for determining the risk as- 
sociated with a parsing decision, and which kinds of 
information are most useful. 

The first set of experiments attaches risk to parser 
actions. These aim to answer the question "can the 
parser assess the quality of its own actions". Note 
that parser actions are not equivalent to attachment 
decisions: the easy-first parser may choose to at- 
tach a token to its correct parent and still be wrong, 
because it is not yet the correct time to do so (be- 
cause the child node is not yet saturated), and this 
action, while resulting in one correct edge, pre- 
vents future correct edges from being added (con- 
sider the Verb+Noun edge in the PP-attachment ex- 
ample above). Thus, this set of experiments can be 
used only for the parse-selection task (selecting as 
good parses those for which there were less than K 
risky actions), and not for the precision-biased pars- 
ing task. 

We experimented with the following feature sets: 
Process-based features'. action_process is a 
minimal set of 5 numerical features which relate 
only to the parsing process itself. These include: 
sentence-length, current number of parent-less to- 
kens, score of the best action (to be applied), score 
of the second best action, and the difference between 
the best and second-best actions. 
State-based features: action.state is a set of 
features which relate only to what the parser sees. 



6 We use the Megam optimizerl Daume III, 2004 ^ 



Here, we use the exact same feature set which is 
used by the parser for predicting the scores of the 
various actions. 

Riskiness of predicted edges The second set of ex- 
periments associates riskiness with edge predictions. 
That is, riskiness is interpreted as "what is the proba- 
bility of this particular predicted edge to be wrong". 
In contrast to the previous experiments, this defini- 
tion of riskiness addresses the full precision-biased 
parsing problem, by abstaining from providing (or 
ruling-out) attachments for edges that are considered 
too risky. 

We experimented with the following feature sets: 
State-based features: as above, the edge_state 
feature set encodes exactly what the parser sees 
when making an attachment decision, i.e., the fea- 
ture set used by the parser when scoring actions. 
However, here wrong parser actions which result in 
a correct edge are considered as correct (non-risky) 
examples. 

Edge-factored features: the edge_f actored fea- 
ture set is not related to the parsing process, and can 
be extracted from the parse tree in a post-processing 
step. Here, we use the same features as used in Ryan 
McDonald's first-order edge-factored MST parser 
( [McDonald et al, 2005} . 

Higher- order features: the edge_higher feature- 
set does not depend on the parsing process, and uses 
more information than the edge-factored one: the 
features of a (token,parent) pair include information 
on the token and the parent, as well as on the sib- 
lings of the token, siblings of the parent, children of 
the token, and parent of the parent. 

edge_state has only a negligible effect on the 
parsing time (as above, the features are already 
extracted by the parser), while edge_f actored 
and edge_higher have a noticeable (though still 
small) effect on the parsing time by adding n feature 
extraction and scoring operations. 

7 Experiments and Results 
7.1 Training 

We followed the following procedure: 
1. Train the easy-first parser on the parser-training 
set, and use it to parse the rest of the data (riskiness- 
training, test and dev, see Sec. [4]) while keeping 



track of the parser's predictionsjj 

2. Extract correct and incorrect decisions and their 
corresponding features (according to the definitions 
above) from the automatically parsed data. 

3. Train a MaxEnt binary classifier on the riskiness- 
training set. 

7.2 Evaluation 

ROC We begin by plotting the ROC curves 
for identifying risky decisions using the differ- 
ent MaxEnt risk predictors with varying risk- 
thresholds. Figure [2] presents the results. The 
curves are not entirely comparable: the two lower 
curves (action.process and action.state) 
identify risky parser actions, while the higher 
curves identify risky edge attachments. There 
is a clear hierarchy between the different predic- 
tors, but even the simplest ones are quite effec- 
tive at identifying risky decisions. The predic- 
tors that attach riskiness to edges are more ef- 
fective than those that attach riskiness to parsing 
actions, even when the same feature-set is used 
(edge_state vs. action.state), and the two 
predictors that use information external to the parser 
(edge_f actored and edge_higher) are better 
than those using information internal to the parser. 
Still, it is interesting to note that the exact same 
feature-set which is available to the parser dur- 
ing parsing is sufficient to assess the riskiness of 
many of the decisions which are based on the same 
feature-set. 

There is only a small difference between the 
edge_f actored feature-set and the one including 
higher-order features (edgeJhigher): while the 
extra contextual information does help, most of the 
riskiness associated with an edge can already be de- 
termined based on the edge itself and sentence-level 
properties (without considering proposed surround- 
ing edges). 

Precision-biased Scores We now turn to evaluat- 
ing the precision-biased results for the various pre- 
dictors. As discussed above, these can be calcu- 
lated only for the three edge-based riskiness predic- 
tors. For the precision-biased results, the parser ab- 
stains from predicting edges with an associated risk- 

8 In case of a smaller treebank, a k-fold jacknifing scheme 
should be used. 




Figure 1 : Precision-biased results on the dev-set. 



iness above a certain riskiness-threshold. Figure [T] 
plots the precision and coverage of the parser for 
varying riskiness-thresholds, using the three differ- 
ent riskiness-predictors. The third plot in the figure 
plots precision against coverage for the same pre- 
dictors. The overall trends are similar to those ob- 
served in the ROC curves, though here the difference 
between edge_f actored and edge_higher is 
somewhat more pronounced. 

With appropriate riskiness thresholds we could 
achieve a coverage as high as 95%, or precision 
of above 97%. Unfortunately, we cannot get both: 
higher precisions mean lower coverages and vice- 
versa. Compared to the ensemble-based riskiness 
estimation (96% precision with 84% coverage) the 
single-parser results are not as strong. The same 
level of coverage (84.6%) results in a precision of 
around 94.2%, and a precision of 96.5% leads to 
a coverage of 70%. A riskiness-threshold of 0.15 
(edgeJiigher predictor) strikes a nice balance of 
just over 95% precision with 80% coverage. A cov- 
erage of 90% gets us a precision of above 92%, 
still substantially higher than the 88.4% of the full- 
coverage baseline parser. The numbers are practi- 
cally the same for the development and test sets. 
Parse-selection scores The parse-selection task has 
two tunable parameters: the riskiness-threshold R, 
and the number of risky decisions (K) above which 
we regard the complete parse-tree as unreliable. 
For each predictor we performed a grid-search over 
these parameters with R ranging in value from to 
0.5 with increments of 0.01, and K ranging from 
to 4. For each point we recorded the preci- 
sion and the sentence-coverage over the develop- 
ment set. We then chose the parameters yielding the 
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False Positive Rate 

Figure 2: ROC Curves for the various riskiness predictors on 
the dev set. The True Positive Rate is the percentage of incorrect 
decisions which were identified as risky, and the False Positive 
Rate is the percentage of correct decisions which were identified 
as risky. 

best sentence-coverage for a given precision level 
(we varied the precision levels from 89 to 99 with 
increments of 0.5). We processed the test-set with 
the selected parameter- values. Figure [3] plots preci- 
sion against sentence-coverage on the test-set using 
the (K,R) obtained on the development set. 

8 Riskiness of PP-attachments 

Having developed a model of assessing the riskiness 
associated with parsing decisions (i.e., the chances 
of a certain decision being wrong), what does it find 




Precision (%) 



Figure 3: Parse-selection (precision vs. sentence-coverage) 
results on the test sets based on best (K,R) values obtained on 
dev set. 

to be risky? A complete analysis of the behaviour is 
well beyond the scope of this paper, but we provide 
a glimpse into what is possible by inspecting the 
model's behaviour on a particular case of syntactic 
ambiguity: PP attachment. Not surprisingly, attach- 
ing a preposition to its parent is judged very risky. 
When sorting POS-tags by the number of times their 
parent-attachment is judged to be risky, prepositions 
are at the top of the list. But do the models learn 
that all PP attachment decisions are risky and abstain 
from attaching any PP to their parent? Or maybe 
some kinds of prepositions riskier than others? Ta- 
ble [2] shows the confusion matrix for prepositions 
as judged by the edge_higher model with a risk- 
threshold of 0.15. Table [3] breaks down the numbers 
by preposition type. 



reality / model 


Risky 


Safe 


Incorrect 


TP: 961 


FN: 326 


Correct 


FP: 1353 


FP: 4302 



Table 2: Preposition's riskiness confusion matrix over dev-set. 
edge_higher features, riskiness-threshold of 0.15. 

Interestingly, while PP attachment is the most 
risky phenomena, most PP attachment cases are cor- 
rectly judged by the model to be non-risky (4302 
cases). 326 other PP attachment cases are judged by 
the model to be safe, but are incorrect. Finally, the 
model marks 23 14 PP-attachment cases as risky, and 
96 1 of these are indeed parsing mistakes. 

When breaking down by preposition type, we 
can observe that "of" is by far the least ambigu- 
ous (1457 cases, or 93%, are correctly marked as 
safe), "in","for","on","as","at" are the most am- 



^position 


TP 


FP 


TN 


FN 


Total 


as 


41 


80 


123 


19 


263 


with 


34 


76 


168 


12 


290 


at 


55 


68 


177 


21 


321 


on 


77 


90 


179 


16 


362 


from 


32 


59 


222 


17 


330 


that 


31 


62 


223 


16 


332 


by 


35 


49 


261 


9 


354 


for 


120 


186 


317 


28 


651 


in 


245 


319 


575 


75 


1214 


of 


18 


40 


1457 


39 


1554 



Table 3: Preposition's riskiness by type over dev-set. 
edge_higher features, riskiness-threshold of 0.15. 
TP: risky/incorrect, FP:risky/correct, TN:safe/correct, 
FN:safe/incorrect 

biguous (marked as risky about half of the time) 
while "by","that","from" are in between (25-35% of 
the cases are judged to be risky). 

9 Discussion 

We advocate a modified version of the parsing task 
- precision-biased parsing - which favors precision 
over recall by allowing the parser to abstain from de- 
cisions about which it is uncertain. In our view, par- 
tial but highly accurate structural information is in 
many cases more valuable than complete but less ac- 
curate structural information. The precision-biased 
parsing problem is related to confidence estimation, 
that is, attaching reliability scores to model predic- 
tions. 

In order to address the precision-biased parsing 
task we introduce the notion of riskiness of parser 
decisions. On the basis of riskiness assessment, the 
parser can abstain from risky predictions. This gives 
rise to a natural solution to the parse-selection task: 
reliable parse-trees are those associated with few 
risky actions. 

After verifying that disagreement in a parser- 
ensemble is a good indicator of risky edges, we 
presented a novel approach that does not rely on 
a parser-ensemble, but instead learns to predict the 
riskiness involved with individual actions of a sin- 
gle parser. While the method sacrifices more cov- 
erage than the parser-ensemble in order to achieve 
the same level of accuracy, the results are encour- 
aging and demonstrate that a single parsing system 
can monitor the confidence of its own predictions. 



Single parser riskiness assessment turns out to be 
a good indicator of confidence on aggregate: the 
single-parser system is as capable as the ensemble- 
based one at selecting high-quality complete parses. 
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Supplementary Material 

edge_State, risk threshold 0.15: 




-Root- Mr. Carder also goes through periods when he buys stocks in conjunction with options to boost returns and protect against declines 

NNP NNP RB VBZ IN NNS WRB PRP VBZ NNS IN NN IN NNS TO VB NN5 CC VB IN NNS 




-Root- Prudential currently is seeking approval to offer a new fund offering a return equal to the S&P 500 index plus 5V100 of a percentage point 

NNP RB VBZ VBG NN TO VB DT JJ NN VBG DT NN JJ TO DT NNP CD NN CC CD IN DT NN NN 



edge.f actored,risk threshold 0.15: 




-Root- Mr. Carder also goes through periods when he buys stocks in conjunction with options to boost returns and protect against declines 

NNP NNP RB VBZ IN NNS WRB PRP VBZ NN5 IN NN IN NNS TO VB NNS CC VB IN NNS 




-Root- Prudential currently is seeking approval to offer a new fund offering a return equal to the S&P 500 index plus 5V100 of a percentage point 

NNP RB VBZ VBG NN TO VB DT JJ NN VBG DT NN JJ TO DT NNP CD NN CC CD IN DT NN NN 



edge_higher, risk threshold 0.15: 




-Root- Mr. Carder also goes through periods when he buys stocks in conjunction with options to boost returns and protect against declines 

NNP NNP RB VBZ IN NNS WRB PRP VBZ NNS IN NN IN NNS TO VB NNS CC VB IN NNS 




-Root- Prudential currently is seeking approval to offer a new fund offering a return equal to the S&P 500 index plus 5V100 of a percentage point 

NNP RB VBZ VBG NN TO VB DT Jj NN VBG DT NN JJ TO DT NNP CD NN CC CD IN DT NN NN 



Figure 4: Precision-biased parse examples of the single-parser systems' predictions on the dev set. 




%S 89 90 91 92 93 94 95 96 97 98 99 



Precision (%) 



Figure 5: Parse- selection based on single-parser risk predictors results (precision vs. sentence-coverage) on dev set. 



